Final Presentation

5/8/25

Connor Wang

Project 1 - Data Visualization

  • Two TidyTuesday datasets

  • Data on Bob’s Burgers Dialogue from various episodes.

  • Data on how Horror Movies performed in the box office/how much the public enjoyed them.

  • American Sitcom with 15 seasons

  • Follows Bob Belcher and his family, who own the restaurant Bob’s Burgers

    # A tibble: 6 × 8
      season episode dialogue_density avg_length sentiment_variance unique_words
       <dbl>   <dbl>            <dbl>      <dbl>              <dbl>        <dbl>
    1      1       1            0.930       37.5               3.32          960
    2      1       2            0.994       33.8               3.99          950
    3      1       3            0.992       31.1               4.08          915
    4      1       4            0.994       32.2               3.71          892
    5      1       5            0.994       34.1               3.78          888
    6      1       6            0.994       33.2               3.30          921
    # ℹ 2 more variables: question_ratio <dbl>, exclamation_ratio <dbl>
  • Explored a possible relationship between the amount of lines that ended with questions versus exclamations

    ggplot(episode_metrics, aes(x = question_ratio, y = exclamation_ratio)) +
      geom_point(color = "blue") + 
      labs(
        title = "Question Ratio vs. Exclamation Ratio in Bob's Burgers episodes",
        x = "Ratio of lines with questions",
        y = "Ratio of lines with exclamations"
      )

  • No clear trends from scatterplot, more questions does not mean more excitement in an epsiode.

  • Popular genre of movies meant to evoke feelings of fear and terror

  • Since 2017, around 40+ horror movies are released each year

    # A tibble: 6 × 20
           id original_title   title original_language overview tagline release_date
        <dbl> <chr>            <chr> <chr>             <chr>    <chr>   <date>      
    1  760161 Orphan: First K… Orph… en                After e… There'… 2022-07-27  
    2  760741 Beast            Beast en                A recen… Fight … 2022-08-11  
    3  882598 Smile            Smile en                After w… Once y… 2022-09-23  
    4  756999 The Black Phone  The … en                Finney … Never … 2022-06-22  
    5  772450 Presencias       Pres… es                A man w… <NA>    2022-09-07  
    6 1014226 Sonríe           Sonr… es                <NA>     <NA>    2022-08-18  
    # ℹ 13 more variables: poster_path <chr>, popularity <dbl>, vote_count <dbl>,
    #   vote_average <dbl>, budget <dbl>, revenue <dbl>, runtime <dbl>,
    #   status <chr>, adult <lgl>, backdrop_path <chr>, genre_names <chr>,
    #   collection <dbl>, collection_name <chr>
  • Explored relationship between a movie’s budget and revenue

    ggplot(horror_movies, aes(x = budget, y = revenue)) +
      geom_point(color = "red") +
      labs(
        title = "Horror Movie Budget versus Revenue",
        x = "Budget",
        y = "Revenue"
      )

  • No clear trends from scatterplot, spending more money does not necessarily mean you will make more money.

Project 2 - Netflix Data Analysis

  • Netflix is a popular streaming service consisting of TV shows, documentaries, movies, and more.
# A tibble: 6 × 12
  show_id type    title director    cast  country date_added release_year rating
  <chr>   <chr>   <chr> <chr>       <chr> <chr>   <chr>             <dbl> <chr> 
1 s1      TV Show 3%    <NA>        João… Brazil  August 14…         2020 TV-MA 
2 s2      Movie   7:19  Jorge Mich… Demi… Mexico  December …         2016 TV-MA 
3 s3      Movie   23:59 Gilbert Ch… Tedd… Singap… December …         2011 R     
4 s4      Movie   9     Shane Acker Elij… United… November …         2009 PG-13 
5 s5      Movie   21    Robert Luk… Jim … United… January 1…         2008 PG-13 
6 s6      TV Show 46    Serdar Akar Erda… Turkey  July 1, 2…         2016 TV-MA 
# ℹ 3 more variables: duration <chr>, listed_in <chr>, description <chr>
  • Created three questions of interest regarding the information from the dataset.

Has movie duration changed over time?

  • Data from the mid 1900s to 2022.

  • Films from multiple countries and genres on Netflix.

  • Manipulated data to include only movies that have a runtime of a hour or longer.

    movies_only <- netflix_titles %>%
      filter(type == "Movie") %>%
      mutate(movie_duration = as.numeric(str_extract(duration, "\\d+"))) %>%
      filter(!is.na(duration) & movie_duration >= 60)%>%
      arrange(release_year)%>%
      select(type, release_year, duration, movie_duration)
    
    head(movies_only)
    # A tibble: 6 × 4
      type  release_year duration movie_duration
      <chr>        <dbl> <chr>             <dbl>
    1 Movie         1943 61 min               61
    2 Movie         1943 82 min               82
    3 Movie         1944 76 min               76
    4 Movie         1945 63 min               63
    5 Movie         1954 116 min             116
    6 Movie         1954 120 min             120
    ggplot(movies_only, aes(x = release_year, y = movie_duration)) + 
      geom_point(color = "lightsalmon3") +
      labs(
        title = "Has Movie duration changed over time?",
        x = "Year Released",
        y = "Duration of Movie in Minutes(starting at 60)"
      )

  • Found no trend, movie durations have stayed in a similar range over the years.

Movies and Shows about the future

  • Dataset included a brief description about each piece of media.

  • Analyzed the number of films, TV shows, documentaries, etc. that were about the future categorized by country.

    content_about_future <- netflix_titles %>%
      filter(str_detect(description, "(?i)future")) %>%
      filter(!is.na(country)) %>%
      group_by(country) %>%
      summarise(count = sum(n())) %>%
      arrange(desc(count)) %>%
      slice_head(n = 10)
    
    head(content_about_future)
    # A tibble: 6 × 2
      country        count
      <chr>          <int>
    1 United States     30
    2 India              9
    3 United Kingdom     5
    4 Indonesia          4
    5 Canada             3
    6 South Korea        3

    ```{ggplot(content_about_future, aes(x = country, y = count)) +} geom_col(fill = “plum3”) + labs( title = “Countries with TV Shows and Movies about the ‘future’”, x = “Country”, y = “Number of TV Shows/Movies” ) + theme(axis.text.x = element_text(size = 7))

    ```

  • United States had the most, but likely because most of the media was from the US to begin with.

  • Most countries are similar, no country is overly creative about the future.

Words preceding ‘of’

  • Dataset included titles of many different works of media.

  • Analyzed which words most commonly came before ‘of’ in a title.

  • Chose ‘of’ because of how common it is used - ex: story of, legend of, etc.

    words <- netflix_titles %>%
      mutate(of = str_extract(str_to_lower(title), "\\b\\w+(?= of)")) %>%
      filter(!is.na(of)) %>%
      group_by(of) %>%
      summarise(n = n()) %>%
      arrange(desc(n)) %>%
      slice_head(n = 10)
    head(words)
    # A tibble: 6 × 2
      of          n
      <chr>   <int>
    1 legend     19
    2 story      19
    3 secrets    16
    4 life       11
    5 tales      11
    6 age         8
    ggplot(words, aes(x = of, y = n)) +
      geom_col(aes(fill = of)) + 
      labs(
        title = "10 Most common words that come before 'of' in TV show/movie titles",
        x = "Word before 'of'",
        y = "Number of titles that it appears before 'of' in"
      )

  • Most common words were (as predicted) legend, story.

  • Similar movie title formats are recycled.

Thank You!